Load all required libraries

library(ggplot2)
library(dplyr)
library(GGally)
library(scales)
library(memisc)
library(gridExtra)

Load dataset and explore dataset

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Let’s get quick summary of each variable

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

The dataset contains 11 quantitative variables with 1599 observations. The output is a qualitative variable.

Analyze the Output Variable(Quality) distribution

From the summary and quality histogram, the quality distribution looks to be normal with mean between 5 and 6 and median at 6.There are no records for quality 9 and 10. As such quality 8 is the highest grade of red wine available in the dataset. The lowest quality seems to be 3. Also, there are a lot of data points for quality 5 and 6 and very few for others.

It’d be interesting to see how each variables affect the quality of wine.

Univariate and Bivariate Analysis

Let’s study impact of each variable on quality in depth based on description and data.

Impact of each variable on quality

Fixed.acidity

Let’s analyze the variable distribution.

Fixed acidity seems to have slightly right tailed distribution. It seems to have some outliers after >14g/dm^3.

The log scaled plot seems to be more normal like distribution. Now, lets analyze the impact of fixed.acidity on quality.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.700   7.150   7.500   8.360   9.875  11.600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.779   8.400  12.500 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.100   7.800   8.167   8.900  15.900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.000   7.900   8.347   9.400  14.300 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.800   8.872  10.100  15.600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.250   8.250   8.567  10.225  12.600

The fixed acidity seems to be in the similar ranges for all quality grades of the red wine. It’d be helpful to visualize the distributions with boxplots and scatter plots.

Looks like the fixed.acidity are in the similar ranges. Fixed.acidity for quality 8 and 3 looks very similar from the box plots. Doesn’t look like fixed.acidity has any impact on the quality.

As can be seen from the above plot, there are lot of data points for quality 5 and 6. Very few for quality 3,4 and 8. The histogram for quality 5 and 6 also looks quite similar.

Volatile.acidity

The distribution seems to be bimodal with peaks around 0.4 and 0.6. Now, lets analyze its impact on quality.

Looks like higher the volatile.acidity lower the quality and vice versa. This is in agreement with the description. Higher amount of acetic acid in wine leads to unpleasant, vinegar taste. This is one of the good indicator of wine quality. Although there are some datapoints with lower volatile.acidity in lower quality, those data points could be affected by other variables.

In this histogram also we can see that there are few data points for quality 3,4 and 8 while large data points for 5 and 6. We can also see the gradual shift of the median between quality 5, 6 and 7.

Citric.acid

Citric acid distribution seems to be gradually decreasing from 0 to .9. Lets analyze its impact on quality.

In general, higher citric.acid seems to be good for wine. 0.3 to 0.53 seems to be good amount of citric.acid in wine.

Citric acid distribution seems to be bimodal for quality 5,6, and 7. Also, second peak seems to be shifting towards right from 0.25 for quality 5 to .4 for quality 7.

Residual.sugar

Let’s analyze the histogram of residual sugar in the wine.

Residual sugar seems to have right tailed distributionwith majority of datapoints between 0 and 4. Lets analyze its impact on quality.

Residual sugars seems to be in the same concentrations for each quality. There are no sweet wines with sugar concentration greater than 45 grams/liter in this dataset. It’d be interesting to see how red sweet wines are rated.

As seen before, most of the data points are for quality 5 and 6. Both of these plots seem to have normal distribution with peak at 2.

Chlorides

Lets visualize the chloride content distribution.

Chloride content seems to be very less with normal distribution and peak at 0.075. Lets look at its impact on quality.

The median seems to be shifting downward from quality 3 to 8. However, there is a huge overlap of interquantile range across all quality grades. Also, the range is narrowing from lower to higher quality. This may also be because of lack data points.

Most data points across different quality grades lie within 0.2 with peak at 0.1.

Free.sulfur.dioxide

Lets visualize the free sulfur dioxide content distribution.

Free sulfur dioxide seems to have bimodal distribution with peaks at 6 and 12. Lets analyze its impact on quality.

This is quite a weird plot with median increasing from quality 3 to 5 and going down from quality 5 to 8. May be the histogram can help analyze the distribution.

The weird shift in the median seems to be due to lack of datapoints. The histogram for quality 5 and 6 seems to be similar. Lets analyze histogram by changing the y scale to log.

In the above plot, we can clearly see the counts of differnt bins of free sulfur dioxide values across differnt quality grades. We can see there are very few data points for quality 3, 4 and 8. We cannot rely on such distributions. However, from quality 5,6, and 7, there doesn’t seem to be any shift in trends.

Also, as per description, free SO2 > 50 ppm would be evident in taste. So, this should affect the wine quality negatively. Unfortunately, there are no many data points with free SO2>50ppm. So, we can’t confirm this statement with the available data points. Also, within the free SO2>50ppm range, because of above observations, we can’t identify any relation between free sulfur concentration and quality.

Total.sulfur.dioxide

Lets visualize the total sulfur dioxide content distribution.

Total sulfur dioxide seems to have normal distribution in log scale. Lets analyze its impact on quality.

Again, we can see the similar trend in median as we saw above. This can be very misleading. Lets look at the histogram.

As expected, due to lack of data points, the trend we saw in the boxplots is invlaid. Both, free sulfur dioxide and total sulfur dioxide seems to have no impact on quality. We are not ablel to identify the best range of value for these variables for better wine quality due to lack data points and overlap of ranges. I guess, even smaller quantity is good enough to prevent microbial growth and oxidation.

Density

Lets visualize the density distribution.

Density seems to have normal distribution. Lets analyze its impact on quality.

From the above boxplots, lower the density seems to be better but there is huge overlap of datapoints for all quality grades. Lets look at the histogram.

We can’t use quality 3,4 and 6 due to insufficient data points. However, we can still see the downward shift in median in quality grades 5,6 and 7.

pH

Lets visualize the pH distribution in wine in this dataset.

pH seems to have normal distribution with peak at 3.3. Lets analyze its impact on quality.

There is significant overlap of pH concentrations on all quality grades. However, there seems to be a small downward trend in median pH content. Looks like lower pH seems to relate to higher quality. Specificially, 3-3.5 seems to be a good range for quality wine.

From quality 5,6,and 7, we can see the small downward trend in pH values for better wine quality. While the lower pH seems to be better, I wouldn’t expect very acidic wine to be pleasant. With more data points across quality grades we could have probably identified the best range of pH for wine.

Sulphates

Lets visualize the sulphate distribution.

Sulphates seems to have right tailed distribution. Lets analyze its impact on quality.

Sulphates contribute to SO2 which acts as antimicroial and antioxidant. However, SO2 itself was identified as having no significant impact on the quality. On the contrary to the SO2 impact, higher Sulphates concentrations seems to lead to higher quality. Specifically, 0.7-0.8 units seems to be good concentration for quality wine. But can we trust the range from this plot? No, let’s look at the histogram.

Very few data points. The best range that we determined from quality 8 is not valid because of lack of data points. However, from quality 5,6 and 7, range seems to be 0.5 to 1.

Alcohol

Lets visualize the alcohol content distribution.

Alcohol content in wine in this dataset seems to have right tailed distribution. Lets analyze its impact on quality.

From the above graphs, higher alcohol percent seems to be good for wine. There is quite discernible growth in quality with increasing medians in alcohol content (except for quality 5, could be affected by other variables)

The above observation can be confirmed from the histogram for quality 5,6, and 7.

Summary of Univariate analysis

From the above univariate analysis, citric acid, sulphates and alcohol concentration seems to be directly proportional to good quality while volatile acidity, density and pH indirectly proportional to good quality of wine. Correlation could be a good metric to rank the influence of these factors on quality.

with(wine, cor(quality, alcohol))
## [1] 0.4761663
with(wine, cor(quality, volatile.acidity))
## [1] -0.3905578
with(wine, cor(quality, sulphates))
## [1] 0.2513971
with(wine, cor(quality, citric.acid))
## [1] 0.2263725
with(wine, cor(quality, density))
## [1] -0.1749192
with(wine, cor(quality, pH))
## [1] -0.05773139

We analyzed each variable with quality using box plots and histograms for each variable. Many variables shared significant overlap in their distribution between various grades of quality. While we segregated the histograms faceted by quality, it’d be good to analyze all of them together on the same plot distinguished by color. This can help us give insights into patterns across quality grades for each variable.

Analyze distribution of variables with quality visually.

In the dataset, the number of datapoints for quality 3,4 and 8 are very less. While we compare the patterns for all quality grades, it’d be good to observe patterns within variables for quality grades 5,6 and 7.

Now, lets analyze patterns within each variable for quality 5,6 and 7.

All quality grades have significant overlap over the range of the variable. This graph will not help deduce any relationship between variable and quality.

Also, looks like we are making similar analysis as we did with boxplots since this is just another way to plot data by changing axes. As such, we can plot all variables together and analyze for any new patterns.

From the above graphs, we can conclude the following.
* Variables with positive relation to quality: Fixed acidity, citric acid, sulphates and alcohol.
* Variables with negative relation to quality: Volatile acidity, density. Also, pH seems to be showing a small negative trend.

Lets enhance the peaks by plotting in log scale to see the shifts in peaks.

The log scale plots are more helpful in confirming the above observations in trends.

Now, lets try to find the best range for each variable for best quality wine. However, lets keep in mind that there are few data points to notice any significant patterns.

The plots look quite uniform due to lack of data points for quality grades 8. It’d be good to consider quality grades 7 and 8 together for higher quality.

The above distribution of variables for quality grades 7 and 8 help narrow the ranges.
* Fixed acidity seems to be better between 7 and 10 units.
* Volatile acidity seems to be better between .25 and .5 units.
* Citric acid concentration seems to be better between 0.3 and 0.5 units.
* Residual sugars are better between 2 and 2.5 units.
* Chlorides are better between 0.05 and 0.8 units.
* Free sulfur dioxide seems to be decaying down from 5 to 40 with good range between 5 and 15.
* Total sulfur dioxide seems to be decaying down from 10 to 100 with good range between 10 and 40.
* Density seems to be better between 0.994 to 0.998.
* pH seems to be better between 3.2 and 3.4.
* Sulphates seems to be good between 0.6 and 0.9.
* Alcohol content seems to be good between 10 and 12.

Multivariate analysis

From the univariate and bivariate analysis, we identified a set of variables which impact the quality. Let’s analyze these set of variables for how they affect each other.

From the pairwise investigation, the following are observed.
* These above variables identified to impact quality by univariate analysis. However, none of these variables have good correlation with quality.
* Fixed.acidity seems to be negatively correlated with volatile.acidity and pH.
* Also, fixed.acidity is positively correlated with citric.acid and density.
* While fixed.acidity is corelated to all factors affecting quality, it surprisingly has no significant impact on quality.
* Alcohol content seems to be negatively correlated with density.

Fixed acidity vs citric acid vs quality

From the above observations, lets explore related varibles.

From the above graph, fixed.acidity and citric.acid seems to have a linear relationship. However, the ratio seems to be similar across different quality grades. Let’s analyze the ratio of citric.acid and fixed.acidity across quality using box plots.

The above graphs shows that although the ratio is increasing with quality, the ratio is very similar and changes marginally between quality.

Fixed.acidity vs density vs quality

Similarly, fixed.acidity and density seems to have a linear relationship but the ratio is similar across different quality grades.

Fixed.acidity vs volatile.acidity vs quality

No discernible relationship between fixed.acidity and volatile.acidity.

Fixed.acidity vs pH vs quality

Although negative, fixed.acidity and pH seem to have a linear relationship. But the ratio seems to be similar across different quality grades.

Alcohol vs density vs quality

Another relation observed from above was that alcohol and density were kind of complementary variables. It’d be good to investigate how the difference would corelate with quality.

Again, like other pairs investigated above, alcohol and density also seems to have negative relation but the ratio is quite similar across quality grades.

Model using all significant variables

While the pairs of variables above helped get more insights the pairs didn’t help much towards devising a good model to predict quality. Lets try and see how the groups of positively related variables corelate with quality.

with(wine, 
     cor(quality, citric.acid+alcohol+sulphates))
## [1] 0.520615
with(wine, 
     cor(quality, volatile.acidity+pH+density))
## [1] -0.3020624
with(wine, 
     cor(quality, citric.acid+alcohol+sulphates-volatile.acidity-pH-density))
## [1] 0.5535468

The above model with all significant variables seems to be more corelated to quality than any individual variables alone.

Linear model

Let’s build a linear model for predicting the quality of wine based on above observations.

m1 <- lm(quality ~ alcohol, data = wine)
m2 <- update(m1, ~ . + citric.acid)
m3 <- update(m2, ~ . + sulphates)
m4 <- update(m3, ~ . - volatile.acidity)
m5 <- update(m4, ~ . - pH)
m6 <- update(m5, ~ . - density)
mtable(m1, m2, m3, m4, m5, m6)
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + citric.acid, data = wine)
## m3: lm(formula = quality ~ alcohol + citric.acid + sulphates, data = wine)
## m4: lm(formula = quality ~ alcohol + citric.acid + sulphates, data = wine)
## m5: lm(formula = quality ~ alcohol + citric.acid + sulphates, data = wine)
## m6: lm(formula = quality ~ alcohol + citric.acid + sulphates, data = wine)
## 
## ======================================================================================================
##                        m1            m2            m3            m4            m5            m6       
## ------------------------------------------------------------------------------------------------------
##   (Intercept)         1.875***      1.830***      1.434***      1.434***      1.434***      1.434***  
##                      (0.175)       (0.171)       (0.176)       (0.176)       (0.176)       (0.176)    
##   alcohol             0.361***      0.346***      0.338***      0.338***      0.338***      0.338***  
##                      (0.017)       (0.016)       (0.016)       (0.016)       (0.016)       (0.016)    
##   citric.acid                       0.730***      0.513***      0.513***      0.513***      0.513***  
##                                    (0.090)       (0.093)       (0.093)       (0.093)       (0.093)    
##   sulphates                                       0.814***      0.814***      0.814***      0.814***  
##                                                  (0.107)       (0.107)       (0.107)       (0.107)    
## ------------------------------------------------------------------------------------------------------
##   R-squared           0.227         0.257         0.284         0.284         0.284         0.284     
##   adj. R-squared      0.226         0.256         0.282         0.282         0.282         0.282     
##   sigma               0.710         0.696         0.684         0.684         0.684         0.684     
##   F                 468.267       276.595       210.501       210.501       210.501       210.501     
##   p                   0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood  -1721.057     -1688.711     -1659.955     -1659.955     -1659.955     -1659.955     
##   Deviance          805.870       773.917       746.576       746.576       746.576       746.576     
##   AIC              3448.114      3385.421      3329.910      3329.910      3329.910      3329.910     
##   BIC              3464.245      3406.930      3356.795      3356.795      3356.795      3356.795     
##   N                1599          1599          1599          1599          1599          1599         
## ======================================================================================================

The R squared values are still not significant enough for any meaningful linear model.

Final plots and Summary

Plot 1: Output variable distribution

Observations: There are very few data points for best quality and worse quality. Most of the data points is dominated by mid-quality. Thus, devising a good predictive model for quality with few data points very difficult.

Plot 2: Univariate and Bivariate analysis

Observations: Data points for quality grades 5 and 6 dominated every variable. Also, there seems to be either similar patterns or huge overlap of data points across different quality grades. As such, no individual variable was good enough to make a predictive model to predict wine quality. However, the above plots helped to generate a range of values for good quality wine. In the above plots, due to lack of data points for best quality(8) wine, clubbing quality grades 7 and 8 helped to generate best ranges of values for all variables for better quality of wine.

Plot 3: Multivariate analysis

Multivariate analysis between different pairs of variables helped get better insights on how variables are related to each other. However, the pattern of this relationship remained quite similar across different quality grades. One such example is shown below.

## quality: 3
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0006024 0.0049172 0.0163683 0.0320490 0.0568965 
## -------------------------------------------------------- 
## quality: 4
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.003797 0.013235 0.020776 0.033750 0.108696 
## -------------------------------------------------------- 
## quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01270 0.02778 0.02844 0.04167 0.09870 
## -------------------------------------------------------- 
## quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01250 0.03411 0.03112 0.04507 0.13929 
## -------------------------------------------------------- 
## quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.03502 0.04476 0.04050 0.05083 0.08831 
## -------------------------------------------------------- 
## quality: 8
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.005455 0.043042 0.046917 0.043204 0.054709 0.059574

During the multivariate analysis, it was found that citric.acid and fixed.acidity are correlated quite linearly. However, this linear relationship was quite similar across quality grades. As can be seen from above, the slope/ratio of the variables changed by 0.01 on an average between different quality grades. It was the similar across other pairs of variables analyzed. Thus, the multivariate analysis also didn’t help to generate any good model for predicting wine quality.

However, while individual variables and pairs didn’t help to generate any good model, all significant variables helped develop a better model than individual variables or pairs. However, the R squared value for this model was also not good enough for predicting wine quality.

Reflection

The red wine data set contains about 1600 data points with 11 quantitative variables. The output is a qualitatitve variable. I started by understanding the distribution of the output(quality) variable. There seemed to be non-uniform number of data points across different quality grades. The number of data points for quality grades 5 and 6 were huge while very few for quality grades 3,4, and 8.

With the above insights in mind about the quality, I moved to exploring each individual variable for different quality grades. As expected from the data set description, alcohol content, sulphates and citric acid seemed to have positive relation with quality. On the hand, volatile.acidity, pH and density seemed to show negative relationship with quality. One interesting discovery was that while fixed.acidity is positively related with variables that impacted quality positively and negatively related with variables that impacted quality negatively, the fixed.acidity variable itself didn’t have any discernible relationship with quality. Also, while all these variables showed visible relationship, there is huge overlap in data points for all these variables for different quality grades. Thus, brought down the correlation scores of any variables with quality.

Then, I proceeded to analyze the distribution of variables to understand the best range of values for each variable. The histogram plots with all quality grades showed good range. Then I went on to narrow down the ranges for quality grade 8 alone. However, due to lack of enough datapoints, the variable ranges looked uniform. Due to this, I decided to analyze grades 7 and 8 togehter as higher quality grades combined. This helped to establish a strong sense of good quality wine with enough data points. It also helped to narrow down ranges of each individual variables for good quality wine.

After the univariate analysis I proceeded with multivariate analysis of the dataset. While the above variables had high corelation with quality, they didn’t seem to be significant enough. However, few pairs of variable showed visible correlation by the scatter plots and correlation values. While I tried to analyze these pairs against quality, the pairs still had significant overlap of data points between different quality grades. Thus, while we established linear relation by visualizations, the model didn’t perform better.

It’d be good to get more data points for higher and lower quality of alcohol and then analyze the relationships. Also, each variable seems to have huge variance in the values for the same grade. This lead to huge overlap of data points for variables between different quality grades, thus, making it difficult to discern a good predictive model. A more granular labeling of quality will help establish a model better.